Our dataset is red wine dataset cointains 12 variables and 1599 observations.It includes 11 variables on the chemical properties of the wine,and one outcome variable that determines the quality of each wine,with rating form 0 (very bad) to 10 (very excellent).The aim of our exploring is to answer the important question:which chemical properties influence the quality of red wines?
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Since I have only one categorical variable(quality),let’s derive two more categorical variables from existing variables. First new variable called sweetness_range cointains two values:high and low,derived form residual.sugar variable. The mean is the base point,if the value >= maen(residual.sugar) it is considered to be high,and if the value lower than maen(residual.sugar) it is considered to be low. Second new variable called alcohol_range same idea as the previous variable.if the value >= maen(alcohol) it is considered to be high,and if the value lower than maen(alcohol) it is considered to be low.
# create a new categorical variable called sweetness_range
sugar_mean <- mean(redw$residual.sugar)
redw$sweetness_range <- ifelse(redw$residual.sugar >= sugar_mean,"high","low" )
# create a new categorical variable called alcohol_range
alcohol_mean <- mean(redw$alcohol)
redw$alcohol_range <- ifelse(redw$alcohol >= alcohol_mean,"high","low" )
Let’s start our exploration by exploring the distribution of various variables .
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
In our data set there is no wines with 10 even 9 rates quality ,most wines quality rate are medium rate (5 and 6),follwed by 7 rate .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Most of wines have fixed acidity between 7.0 and 8.0 .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH distribution is almost a normal distribution,pH peaking around 3.3 with some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The peak is 0.0 g of citric acid,and there is an outlier equlas to 1.0 g.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol is skewed to the right, with most wines placed between around 9.0 and 9.5 .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The density is almost in bell shape a normal distribution,the mean is 0.9967 .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The residua sugar is a long tail distribution,so I transformed it using log10.Most of the observations are between 2.0 and 2.5 ,there are many outliers(around 15.0).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
It is a right skewed distribution,so I transformed it using log10 ,it is peaking around 0.7.There are outliers (around 2.0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides variable also has a long tail,I transformed using log10,it is peaking at around 0.07 .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The mean of volatile.acidity around 0.5 .
##
## high low
## 435 1164
As we ca see ,most of our samples are considered to as a low sweetness range
##
## high low
## 683 916
The low alcohol range has a count higher than high range alcohol.
The red wine data set contains 1599 samples with 12 variables ,11 of them are chemical properties of the wine .The quality variabel is rating variable its value is from 0 to 10, while 0 (very bad) and 10 (very excellent).
Other observations: -Most of wines quality rate is a meduim rate(5-6) -The median ph is about 3.3 -Most of wines have 0 citric acid -Around 75% of wines have sugar less than 2
The main feature of this data set that I’m instersting in, is the quality and how the chemical properties affect and influence it.What makes wine get high quality?
I think all other features and variables will support my investigation about wine quality.
yes, first one is sweetness_range was derived from residual.sugar variable,and second one is alcohol_range was derived from alcohol variable.
I get many right skewed and long tail distributions such as(resiual sugaer and chlorides) and I used bin width and log10 trnsformation to get better visualizations.
It is turn to discover relationships between two variables.
After we have a general look using the scatterplot matrices.Let’s explore and discover interesting things and relationships between our variables.
Is there a relationship between alcohol and quality of wine?
## redw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## redw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## redw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## redw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## redw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## redw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
It look likes there is a positive relationship ,as alcohol increases the quality of the wine increases.
let’s explore another variable ,volatile.acidity vs. quality.
## redw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## redw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## redw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## redw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## redw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## redw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
It seems there is a negative relationship,as volatile acidity decreases the quality increases.
Now,It is the third variable turn,chlorides vs. quality.
## redw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
## --------------------------------------------------------
## redw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
## --------------------------------------------------------
## redw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
## --------------------------------------------------------
## redw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
## --------------------------------------------------------
## redw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
## --------------------------------------------------------
## redw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
Low chlorides (almost 0.0) contributes to get high quality.There is a negative relationship,as chlorides decreases the quality increases.
let’s have a look how the alcohol and residual.sugar affect and influence the quality,using our new categroical variables.
##
## high low
## 683 916
## redw$quality: 3
##
## high low
## 3 7
## --------------------------------------------------------
## redw$quality: 4
##
## high low
## 20 33
## --------------------------------------------------------
## redw$quality: 5
##
## high low
## 137 544
## --------------------------------------------------------
## redw$quality: 6
##
## high low
## 335 303
## --------------------------------------------------------
## redw$quality: 7
##
## high low
## 172 27
## --------------------------------------------------------
## redw$quality: 8
##
## high low
## 16 2
##
## high low
## 435 1164
## redw$quality: 3
##
## high low
## 3 7
## --------------------------------------------------------
## redw$quality: 4
##
## high low
## 15 38
## --------------------------------------------------------
## redw$quality: 5
##
## high low
## 193 488
## --------------------------------------------------------
## redw$quality: 6
##
## high low
## 156 482
## --------------------------------------------------------
## redw$quality: 7
##
## high low
## 62 137
## --------------------------------------------------------
## redw$quality: 8
##
## high low
## 6 12
It proves what we’ve seen before in the scatter plot alcohol vs quality,there is a positive relationship,as alcohol increases the quality increases ,but it is opposite in residual sugar as sweetness decreases the quality increases(negative relationships).
Let’s now explore some relationships between some supporting variables away from our outcome variable(quality).
Scatter plot appears there is a negative relationship in some way between fixed.acidity vs. pH, as pH decreases the fixed acidity increases.
A positive relationship between fixed acidity vs. citric.acid.
It seems there is no a clear relationship between chlorides vs. citric acid,chlorides is constant and the citric acid is increasing.
Most of our samples have chlorides around 0.1 and residual sugar around 2.0 .
During my investigation I discovered a positive relationship as well negative relationships.The quality has a positive relationship with alcohol.However,quality has a negative relationship with volatile acidity ,chlorides,and residual sugar.
Actually I did plots for many variables against each other,but I don’t get a clear relationship,except two plots that show a relationship.There is a negative relationship between fixed acidity and pH,as well as a positive relationship between fixed acidity and citric acid .
Alcohol is correlated in positive way with the quality rate,as one increases the other one increases,as well as this was proved using two different plots type.
Let’s now explore there or more variables against each other.
It seems the majority of our samples are placed in low sweetness range. As it appears the high quality rates (7-8) show high alcohol range with low residual sugar .
let’s discover quality vs. ph against alcohol range and sweetness range.
The high quality rate with high alcohol range and low sweetness range has a ph around 3.2 to 3.3,while the lower quality rate has ph around 3.3. to 3.5 .
let’s discover quality vs. citric acid against alcohol range and sweetness range.
The lowest quality rate(3) has the lowest citric acid almost 0.0,while the highest one has citric acid around 0.4 The majority of 8 rate quality has high alcohol and low residual sugar with around 0.4 to 0.5 citric acid.
High quality rate in most cases has high alcohol with low residual sugar.In addition it has ph around 3.2 to 3.3 ,as well as it has 0.4 to 0.5 g of citric acid .
The quality rate(3) has citric acid almost 0.0 ,against the both categorical variables alcohol_range and sweetness_range
No
The most significat features in the plot are,it seems it is very hard to get a high quality rates,since the majority of our samples considered to be as a meduim rate(5-6).We notice also there is no rate with 9 or 10 !
We can notice the most significat features in the plots are,as alcohol increases the quality increases as well,and as residual sugar decreases the quality increases. quality vs. alcohol > positive relationship quality vs. residual sugar > negative relationship
The most significat features in the plots,are the lowest quality rate(3) that we have in our data has citric acid almost 0.0 !,and this is true against our the both categorical variables alcohol_range and sweetness_range.while the high quality rate has around 0.4 to 0.5 citric acid.
The red wine dataset cointains 12 variables and 1599 observations.It includes 11 variables on the chemical properties of the wine and one categorical variable. During my exploriation there is no serious problems,but there is a kind of some struggling to understand the chemical terms.Since this dataset contains only one categorical variable,so to get better plots I derived two more categorical variables from existing variables. My big aim is to understand the relationships between quality and other variables,and what make wine get high quality rate. First of all,it look likes most of wines got a medium rare!,it seems it is very hard to get high quality! I discoverd a positive relationship between quality and alcohol,and negative relationship with volatile acidity ,chlorides,and residual sugar. We can use these relationships to make prediction model of wine quality.